Google Play Store Data Analysis Report
In this data set, I want to find how to predict game reviews(continuous,linear model) and Installs(discrete,tree-based model) using the known information.
# creat a new environment
rm(list=ls())
# import packages
library(MASS)
library(readr)
library(ggplot2)
library(corrplot)
library(Amelia)
library(reshape2)
library(caret)
library(caTools)
library(dplyr)
library(tidyr)
library(plotly)
library(texreg)
library(reshape2)
library(leaps)
library(rpart)
library(rpart.plot)
library(e1071)
google_data <- read_csv('googleplaystore.csv')
head(google_data)
## # A tibble: 6 x 13
## App Category Rating Reviews Size Installs Type Price `Content Rating`
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Phot~ ART_AND~ 4.1 159 19 10000 Free 0 Everyone
## 2 Colo~ ART_AND~ 3.9 967 14 500000 Free 0 Everyone
## 3 "U L~ ART_AND~ 4.7 87510 8.7 5000000 Free 0 Everyone
## 4 Sket~ ART_AND~ 4.5 215644 25 50000000 Free 0 Teen
## 5 Pixe~ ART_AND~ 4.3 967 2.8 100000 Free 0 Everyone
## 6 Pape~ ART_AND~ 4.4 167 5.6 50000 Free 0 Everyone
## # ... with 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current
## # Ver` <chr>, `Android Ver` <chr>
# Draw the data missmap
missmap(google_data, legend=FALSE)
The main NA data are in the ‘Rating’ column, so we decided to use drop_na() function to drop the row containing NA data.
# drop the null data from the original data set
google_data <- drop_na(google_data)
# see first few lines of data
head(google_data)
## # A tibble: 6 x 13
## App Category Rating Reviews Size Installs Type Price `Content Rating`
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Phot~ ART_AND~ 4.1 159 19 10000 Free 0 Everyone
## 2 Colo~ ART_AND~ 3.9 967 14 500000 Free 0 Everyone
## 3 "U L~ ART_AND~ 4.7 87510 8.7 5000000 Free 0 Everyone
## 4 Sket~ ART_AND~ 4.5 215644 25 50000000 Free 0 Teen
## 5 Pixe~ ART_AND~ 4.3 967 2.8 100000 Free 0 Everyone
## 6 Pape~ ART_AND~ 4.4 167 5.6 50000 Free 0 Everyone
## # ... with 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current
## # Ver` <chr>, `Android Ver` <chr>
# see the data type of the data set
str(google_data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 7683 obs. of 13 variables:
## $ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite <e2><U+0080>?FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
## $ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : chr "159" "967" "87510" "215644" ...
## $ Size : chr "19" "14" "8.7" "25" ...
## $ Installs : chr "10000" "500000" "5000000" "50000000" ...
## $ Type : chr "Free" "Free" "Free" "Free" ...
## $ Price : chr "0" "0" "0" "0" ...
## $ Content Rating: chr "Everyone" "Everyone" "Everyone" "Teen" ...
## $ Genres : chr "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
## $ Last Updated : chr "7-Jan-18" "15-Jan-18" "1-Aug-18" "8-Jun-18" ...
## $ Current Ver : chr "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
## $ Android Ver : chr "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
As we directly import the data set from csv. file, the data type of these columns are wrong. We need correct them manually.
# change data type of the original data set
google_data$App <- as.character(google_data$App)
google_data$Reviews <- as.numeric(as.character(google_data$Reviews))
google_data$Price <- as.numeric(as.character(google_data$Price))
google_data$Size <- as.numeric(google_data$Size)
google_data$`Current Ver` <- as.character(google_data$`Current Ver`)
google_data$`Android Ver` <- as.character(google_data$`Android Ver`)
google_data$Category <- as.factor(google_data$Category)
google_data$Genres <- as.factor(google_data$Genres)
google_data$`Content Rating` <- as.factor(google_data$`Content Rating`)
google_data$Installs <- as.numeric(google_data$Installs)
google_data <- google_data[-which(names(google_data)=='Type')]
# get a summary information of the data set
summary(google_data)
## App Category Rating
## Length:7683 FAMILY :1602 Min. :1.000
## Class :character GAME : 966 1st Qu.:4.000
## Mode :character TOOLS : 627 Median :4.300
## MEDICAL : 324 Mean :4.174
## LIFESTYLE : 279 3rd Qu.:4.500
## PERSONALIZATION: 278 Max. :5.000
## (Other) :3607
## Reviews Size Installs
## Min. : 1 Min. : 0.023 Min. :1.000e+00
## 1st Qu.: 107 1st Qu.: 6.000 1st Qu.:1.000e+04
## Median : 2310 Median : 16.000 Median :1.000e+05
## Mean : 296209 Mean : 36.987 Mean :8.459e+06
## 3rd Qu.: 38825 3rd Qu.: 37.000 3rd Qu.:1.000e+06
## Max. :44893888 Max. :994.000 Max. :1.000e+09
##
## Price Content Rating Genres
## Min. : 0.000 Adults only 18+: 2 Tools : 627
## 1st Qu.: 0.000 Everyone :6138 Entertainment: 446
## Median : 0.000 Everyone 10+ : 316 Education : 417
## Mean : 1.132 Mature 17+ : 363 Medical : 324
## 3rd Qu.: 0.000 Teen : 863 Action : 321
## Max. :400.000 Unrated : 1 Lifestyle : 278
## (Other) :5270
## Last Updated Current Ver Android Ver
## Length:7683 Length:7683 Length:7683
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
#since there are some blanks in the name of features, we decided to rename it
data <-rename(google_data,content_rating=`Content Rating`, Android_ver= `Android Ver`,current_ver = `Current Ver`)
head(data)
## # A tibble: 6 x 12
## App Category Rating Reviews Size Installs Price content_rating Genres
## <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 Phot~ ART_AND~ 4.1 159 19 10000 0 Everyone Art &~
## 2 Colo~ ART_AND~ 3.9 967 14 500000 0 Everyone Art &~
## 3 "U L~ ART_AND~ 4.7 87510 8.7 5000000 0 Everyone Art &~
## 4 Sket~ ART_AND~ 4.5 215644 25 50000000 0 Teen Art &~
## 5 Pixe~ ART_AND~ 4.3 967 2.8 100000 0 Everyone Art &~
## 6 Pape~ ART_AND~ 4.4 167 5.6 50000 0 Everyone Art &~
## # ... with 3 more variables: `Last Updated` <chr>, current_ver <chr>,
## # Android_ver <chr>
We intend to analyze the data in the following aspects
# draw the ggplot image
# image_2 is the distribution of App rating
image_2 <- ggplotly(ggplot(google_data, aes(x=Rating)) + geom_area(stat="bin",fill='#1E90FF') +geom_vline(xintercept = mean(Rating),col='red',lty=3,lwd = 1 )+xlab("Rating Score")+ylab("Number of App") +theme_bw() +theme(plot.title = element_text(hjust = 0.5),axis.text.y = element_text(angle=90,hjust=1)))
# image_3 is the distribution of App reviews
image_3 <- ggplotly(ggplot(google_data, aes(x=Reviews)) + geom_area(stat="bin",fill='#98FB98') +geom_vline(xintercept = mean(Reviews),col='red',lty=3,lwd = 1 )+xlab("Reviews Amount")+ylab("Number of App") + ggtitle("Distribution of Different Features") +theme_bw() +theme(plot.title = element_text(hjust = 0.5),axis.text.y = element_text(angle=90,hjust=1)))
# image_4 is the distribution of App size
image_4 <- ggplotly(ggplot(google_data, aes(x=Size)) + geom_area(stat="bin",fill='#DAA520') +geom_vline(xintercept = mean(Size),col='red',lty=3,lwd = 1 )+xlab("Reviews Amount")+ylab("Number of App") +theme_bw() +theme(plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1),axis.text.y = element_text(angle=90,hjust=1)))
# image_5 is the distribution of App installs
image_5 <- ggplotly(ggplot(google_data, aes(x=Installs)) + geom_area(stat="bin",fill='#D2691E') +geom_vline(xintercept = mean(Installs),col='red',lty=3,lwd = 1 )+xlab("Reviews Amount")+ylab("Number of App") +theme_bw() +theme(plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1),axis.text.y = element_text(angle=90,hjust=1)))
subplot(image_2,image_3,image_4,image_5,nrows=2, margin = 0.05)
Finding attach(google_data)
## The following objects are masked from google_data (pos = 3):
##
## Android Ver, App, Category, Content Rating, Current Ver,
## Genres, Installs, Last Updated, Price, Rating, Reviews, Size
# compute the average Rating score by group
cat_data <- group_by(google_data,Category)
data <- summarise(cat_data,count = n(),rating_score = mean(Rating))
# sort the data by descending order
data <- data[order(data$rating_score,decreasing = TRUE),]
# iamge_6 is the different rating grouped by 'categories'
image_6 <- ggplot(data = data,aes(x = levels(Category),y = rating_score,fill=levels(Category))) + geom_bar(stat = 'identity', position = 'dodge')+theme(axis.text.x = element_text(angle = 90, hjust = 1))+geom_hline(aes(yintercept=4), colour="white", linetype="dashed")+theme(panel.border = element_blank())+ coord_cartesian(ylim=c(3.5, 4.5)) +xlab("Categories in Google Play Store")+ylab("Rating")+ ggtitle("Rating in Different Categories") +theme(plot.title = element_text(hjust = 0.5))+scale_fill_discrete(name="Category")
ggplotly(image_6)
Finding # calculate the reviews grouped by 'categories'
cat_data <- group_by(google_data,Category)
cat_data <- summarise(cat_data,count = n(),reviews = sum(Reviews))
# 'p' respresents 'Reviews' grouped by 'categories'
p <- google_data %>%
plot_ly(
x = ~Category,
y = ~Reviews,
split = ~Category,
type = 'violin',
box = list(
visible = T
),
meanline = list(
visible = T
)
) %>%
layout(
title = "Reviews in Different Categories",
xaxis = list(
title = "Categories in Google Play Store"
),
yaxis = list(
title = "Reviews",
zeroline = F
)
)
p
Finding # Fitted line by 'Rating' and 'Reviews'
line_1 <- lm(Rating~Reviews,data = google_data)
new <- data.frame(google_data$Reviews)
y <- predict(line_1,newdata = new)
p <- plot_ly(data = google_data, x = ~google_data$Reviews, y = ~google_data$Rating,type = 'scatter',name='Actual Points') %>%add_trace(y = ~y, name = 'Linear Regression', mode = 'lines')%>%
layout(
title = "Rating vs Reviews",
xaxis = list(
title = "Reviews"
),
yaxis = list(
title = "Rating",
zeroline = F
)
)
p
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
Finding # change the data type and pick the numeric ones from all the columns
google_data$Category <- as.numeric(google_data$Category)
google_data$Genres <- as.numeric(google_data$Genres)
google_data$`Content Rating` <- as.numeric(google_data$`Content Rating`)
google_num <- select_if(google_data,is.numeric,genres = google_data$Genres,content_rating = google_data$`Content Rating`)
google_num <- mutate(google_num,Category=google_data$Category)
Claim:The average value of ‘rating’ equals to 4
# Test if the average value of 'rating' equals to 4
t.test(Rating,mu = 4,alt = "two.sided", conf=0.95,data=google_data)
##
## One Sample t-test
##
## data: Rating
## t = 27.907, df = 7682, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 4
## 95 percent confidence interval:
## 4.161373 4.185757
## sample estimates:
## mean of x
## 4.173565
Finding
1. p-value in T-Test is less than 0.05, which means our claim can be accepted in the 95% cofidencee interval
#draw the correlation map of different features in numeric columns of google_data
cor_google <- cor(google_num)
melted_cor <- melt(cor_google)
cor_image <- ggplot(data = melted_cor, aes(x=Var1, y=Var2, fill=value)) + geom_tile() +xlab("")+ylab("")+ ggtitle("The Correlation of Different Features") +theme(plot.title = element_text(hjust = 0.5))
ggplotly(cor_image)
Here I change the data type of some columns again and visulaize the correlation of them using the map above The brighter the square is, the higher relationship the two features will be. #set a seed
set.seed(123)
#Split the data to train set and test set
split = sample.split(google_num,SplitRatio =0.75)
train = subset(google_num,split==TRUE)
## Warning: Length of logical index must be 1 or 7683, not 8
test = subset(google_num,split=FALSE)
set.seed(123)
train.control = trainControl(method = "repeatedcv",
number = 10, repeats = 3)
regsubsets.out <- regsubsets( Reviews ~ .,
data = train,
nbest = 1,
nvmax = NULL,
force.in = NULL, force.out = NULL,
method = 'forward')
summary(regsubsets.out)
## Subset selection object
## Call: regsubsets.formula(Reviews ~ ., data = train, nbest = 1, nvmax = NULL,
## force.in = NULL, force.out = NULL, method = "forward")
## 7 Variables (and intercept)
## Forced in Forced out
## Category FALSE FALSE
## Rating FALSE FALSE
## Size FALSE FALSE
## Installs FALSE FALSE
## Price FALSE FALSE
## `Content Rating` FALSE FALSE
## Genres FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: forward
## Category Rating Size Installs Price `Content Rating` Genres
## 1 ( 1 ) " " " " " " "*" " " " " " "
## 2 ( 1 ) " " "*" " " "*" " " " " " "
## 3 ( 1 ) " " "*" "*" "*" " " " " " "
## 4 ( 1 ) " " "*" "*" "*" " " " " "*"
## 5 ( 1 ) "*" "*" "*" "*" " " " " "*"
## 6 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
model_1 <- lm(Reviews ~.-Price,data=train)
# use train_validation method to train linear model
model_1_cv= train(Reviews ~.-Price , data = train, method = "lm",
trControl = train.control)
model_1_cv
## Linear Regression
##
## 5763 samples
## 7 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 5187, 5186, 5187, 5187, 5186, 5187, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1391093 0.4606649 302535.2
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Here I splitted the original data set into train set and test set, and use foward mothod to see what the best polynomial model is
Finding
1. The image tells us that the best polynomial model should have all the variables except ‘Price’
2. The performance of the model is so so, reaching 0.46 in R-squared
# since the pure polynomial model can not fit the data very well, we try to add some interactions into the model
# Since there is strong relationship between Reviews and Installs, we try to add this first
model_2 <- lm(Reviews ~ Size*Installs+`Content Rating`+Genres+Category,data=train)
# model_2_cv considerates the interaction between different features
model_2_cv= train(Reviews ~ Size*Installs*Rating, data = google_num, method = "lm",
trControl = train.control)
model_2_cv
## Linear Regression
##
## 7683 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 6915, 6914, 6915, 6915, 6914, 6914, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1199619 0.6297199 250629.3
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Finding
1. The performance of the model is far better than the previous one, reaching 0.60 in R-squared
# In order to know the performance of the two models, we use anova function to get F-Test
anova(model_1,model_2)
## Analysis of Variance Table
##
## Model 1: Reviews ~ (Category + Rating + Size + Installs + Price + `Content Rating` +
## Genres) - Price
## Model 2: Reviews ~ Size * Installs + `Content Rating` + Genres + Category
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5756 1.2336e+16
## 2 5756 9.9482e+15 0 2.3882e+15
Finding
1. The p-value of the model is very low, which means the added parameter – interaction is pretty important to imrove the performance of the model
# visualize the performance of the model_1_cv
pred_1 <- predict(model_1_cv,test)
google_num <- mutate(google_num, pred_1 = pred_1)
image_3 <- ggplot(data = google_num,aes(x = test$Reviews, y = pred_1)) + geom_point(stat='identity')+ geom_point(stat='identity')+geom_abline(slope= 1,intercept = 0,colour='orange')+xlab("Actual Values")+ylab("Predicted Values")+xlab("Actual Values")+ylab("Predicted Values")+theme_bw()
ggplotly(image_3)
# visualize the performance of the model_2_cv
pred_2 <- predict(model_2_cv,test)
google_num <- mutate(google_num, pred_2 = pred_2)
image_4 <- ggplot(data = google_num,aes(x = test$Reviews, y = pred_2)) + geom_point(stat='identity')+geom_abline(slope= 1,intercept = 0,colour='orange')+xlab("Actual Values")+ylab("Predicted Values")+theme_bw()
ggplotly(image_4)
Here the two images show the different performance of Linear Model Apparently the second(with interactions) is fat better than than the first one
#Since 'Installs' is the dicrete data, we want to predict the installs an App will be(eg:500+, 5000+)
train <- mutate(train,Installs= as.factor(Installs))
# set up a tree classification
classifier_tree = train(Installs ~Reviews+Rating+Category+Size+Genres, data = train, method = "rpart",parms = list(split = "information"),trControl=train.control,tuneLength = 10)
# visualize the tree model
plot(classifier_tree$finalModel)
text(classifier_tree$finalModel)
# use prp function to visualize the tree model
prp(classifier_tree$finalModel, box.palette = "Reds", tweak = 1.2)
# print the detailed information about tree model
print(classifier_tree)
## CART
##
## 5763 samples
## 5 predictor
## 19 classes: '1', '5', '10', '50', '100', '500', '1000', '5000', '10000', '50000', '1e+05', '5e+05', '1e+06', '5e+06', '1e+07', '5e+07', '1e+08', '5e+08', '1e+09'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 5185, 5188, 5188, 5186, 5187, 5187, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.001800679 0.5220112 0.4594452
## 0.001973821 0.5218374 0.4591094
## 0.002701018 0.5193497 0.4560349
## 0.004570954 0.5142599 0.4492838
## 0.016621650 0.4804716 0.4091054
## 0.023270310 0.4593637 0.3811746
## 0.065447746 0.3982189 0.3086372
## 0.092457926 0.3617935 0.2634337
## 0.094951174 0.3201316 0.2101014
## 0.129648868 0.2828969 0.1618843
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001800679.
y_pred = predict(classifier_tree, newdata = test)
df<-data.frame(table(test$Installs, y_pred))
df <- mutate(df,Var1 = as.numeric(Var1),y_pred=as.numeric(y_pred),error=Freq)
# cor_image is to visualze the confusion matrix of decision tree model
cor_image <- ggplot(data = df, aes(x=Var1, y=y_pred, fill=error)) + geom_tile() +xlab("Actual Values")+ylab("Predicted Values")+ ggtitle("Confusion Matrix") +theme(plot.title = element_text(hjust = 0.5))
ggplotly(cor_image)
error <- mean(test$Installs != y_pred) # Misclassification error
paste('Accuracy',round(1-error,4))
## [1] "Accuracy 0.534"
classifier_tree = train(Installs ~Reviews+Rating+Category+Size+Genres, data = train, method = "rpart",parms = list(split = "information"),trControl=train.control,tuneLength = 5)
# visualize the tree model
plot(classifier_tree$finalModel)
text(classifier_tree$finalModel)
# use prp function to visualize the tree model
prp(classifier_tree$finalModel, box.palette = "Reds", tweak = 1.2)
# print the detailed information about tree model
print(classifier_tree)
## CART
##
## 5763 samples
## 5 predictor
## 19 classes: '1', '5', '10', '50', '100', '500', '1000', '5000', '10000', '50000', '1e+05', '5e+05', '1e+06', '5e+06', '1e+07', '5e+07', '1e+08', '5e+08', '1e+09'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 5186, 5187, 5185, 5187, 5184, 5188, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.02327031 0.4582076 0.3800382
## 0.06544775 0.3949644 0.3047400
## 0.09245793 0.3615083 0.2631028
## 0.09495117 0.3274837 0.2196920
## 0.12964887 0.2827252 0.1617018
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02327031.
y_pred = predict(classifier_tree, newdata = test)
df<-data.frame(table(test$Installs, y_pred))
df <- mutate(df,Var1 = as.numeric(Var1),y_pred=as.numeric(y_pred),error=Freq)
# cor_image is to visualze the confusion matrix of decision tree model
cor_image <- ggplot(data = df, aes(x=Var1, y=y_pred, fill=error)) + geom_tile() +xlab("Actual Values")+ylab("Predicted Values")+ ggtitle("Confusion Matrix") +theme(plot.title = element_text(hjust = 0.5))
ggplotly(cor_image)
error <- mean(test$Installs != y_pred) # Misclassification error
paste('Accuracy',round(1-error,4))
## [1] "Accuracy 0.4588"
mtry = sqrt(ncol(train))
tunegrid = expand.grid(.mtry=mtry)
metric = "Accuracy"
# set up a random forest model to predict installs
classifier_rf = train(Installs ~Reviews+Rating+Category+Size+Genres, data = train, method = "rf",
metric=metric, tuneGrid=tunegrid, trControl=train.control, tuneLength = 5)
# print detailed information about random forest model
print(classifier_rf)
## Random Forest
##
## 5763 samples
## 5 predictor
## 19 classes: '1', '5', '10', '50', '100', '500', '1000', '5000', '10000', '50000', '1e+05', '5e+05', '1e+06', '5e+06', '1e+07', '5e+07', '1e+08', '5e+08', '1e+09'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 5185, 5185, 5185, 5188, 5189, 5186, ...
## Resampling results:
##
## Accuracy Kappa
## 0.5672261 0.5159021
##
## Tuning parameter 'mtry' was held constant at a value of 2.828427
y_pred = predict(classifier_rf, newdata = test)
# Checking the prediction accuracy
df<-data.frame(table(test$Installs, y_pred)) # Confusion matrix
df <- mutate(df,Var1 = as.numeric(Var1),y_pred=as.numeric(y_pred),error=Freq)
cor_image <- ggplot(data = df, aes(x=Var1, y=y_pred, fill=error)) + geom_tile() +xlab("Actual Values")+ylab("Predicted Values")+ ggtitle("Confusion Matrix") +theme(plot.title = element_text(hjust = 0.5))
ggplotly(cor_image)
error <- mean(test$Installs != y_pred) # Misclassification error
paste('Accuracy',round(1-error,4))
## [1] "Accuracy 0.8916"
From the analysis report above, we have got 4 models already. let’s do some comparison between the models
1. Linear Model
- anova test tells us already that polynomial model with interactions is better than polynomial model
- compared to the previous one, the new model has a higher r-square, which means that it can explain more data
2. Tree-based Model
- based on confusion matrix, we could find that Random Forest Model performs better than Decision Tree Modle
- The Missclassfication error of random forest model is far less than the decision tree model
- Thus we could make a conclusion that random forest is a better model
Potential Next Steps:
1. evaluate the tree model using more metrics
2. Analyze the categorical variable and see if there are some correlationship among them. (Maybe NLP if possible)